1) Introduction
2) Data Acquisition
2.1 LSE department staff
2.2 LSE research
3) Data Manipulation & Exploration
3.1 Loading the datasets
3.2 Initial exploration plot
3.3 Data manipulation
4) Data Analysis
4a. How does research productivity vary across departments?
4b. What are the factors affecting the average productivity?
- Department Size
- Research Staff Ratio
- Professor Ratio
- Dr Ratio
- External Collaborator Ratio
5) Conclusion
LSE is renowned for their research publications, particularly as a university specialising in social science. It was recognised by the 2021 Research Excellence Framework (REF) as the top university in the UK based on proportion of world leading (four star) research outputs produced–58% of all research produced by the university was world-leading (https://www.lse.ac.uk/News/Latest-news-from-LSE/2022/e-May-22/REF-2021-results). It follows then that LSE devotes a significant percent of their funding to research. The current way that this is done is by allocating £3,000 to each eligible member of staff per full-time equivalent, according to the 2023-24 Departmental Funding Guidelines.
However, this is not the most efficient way of allocating funds as various departments have different levels of research potential. We want to investigate the research productivity within LSE across different departments. This information is not only useful for budgeting purposes but may also prove helpful to other parties. The research productivity analysis could help potential PhD students who are deciding between similar departments, such as sociology and anthropology, and want to consider research productivity as the deciding factor. Or even just for any student considering LSE as an institution to study at.
Apart from analysing the productivity differences, we also want to investigate potential root causes for any discrepancies. Some factors we are considering are the size of the department as measured by number of stuff, the proportion of research staff, the proportion of drs vs. professors, and external collaboration.
We have been unable to find any similar analysis done on this topic with the focus seemingly more on the quality of research published in different departments in LSE. As LSE prides itself on its research, we hope this analysis provides some insight into topics that haven’t been considered before.
We aim to answer the questions:
note:
Due to time limitation and considering the complexity of the data acquisition part, we will be focusing on 12 departments:
Social Policy, Anthropology, Finance, Mathematics, Statistics, Psychological and Behavioural Science, International Relations, Management, Sociology, Geography and Environment, Economic History, Government.
To investigate specific departments, we require staff information for each department. This data is available in the staff section of each of LSE's departmental specific pages. To convert the data into a format that we could anlayse and manipulate, we decided to use webscraping. Considering that some departments have very different webpage and html structures, for convenience's sake, we first webscraped the departments with similar formats. As this only gave us a few departments and not enough for a sufficiently well rounded analysis, we then further webscraped some more departments, giving us a total of 12 departments.
As it is tedious and repetitive, we ommited the process in this document and provided a separate notebook for the research data acquisition in Data Acquisition - Departmental Staff Data via Webscraping.ipynb.
We stored the information on staff members' departments, names, label, and title (Dr, Professor) in a dataframe and chose to omit PhD students as both the research publications and the funding does not relate to them. Then, based on their label, we identified each staff member as research based or non-research based and put that category in a new column.
More details on how we acquired and cleaned the data set and dealt with various problems can be found in the data acquisition notebook with the final prepared data set in the Data folder as a csv file called departmental_staff_data.csv
In order to discuss and explore the research productivity, we have to obtain the information about LSE research, which is available in a LSE research database here: https://eprints.lse.ac.uk/ . While this data can be webscraped, it is already available in JSON format, which is semi-structured and so much easier to convert to a dataframe and manipulate, which we decided to take advantage of to extract the information. All the required JSON files can be found under the Data folder as well as the final csv file titled departmental_publications_data.csv.
For each publication, we stored the title, the department, the date, the title of the authors, number of authors, and number of authors who are LSE staff based on whether they had an LSE Institute ID. Getting this data was much more straightforward than webscraping, however we have ommited the process in this document and provided a separate notebook for the research data acquisition in Data Acquisition - Publications per Department via JSON.ipynb. More details on how we dealt with the data can be found in that notebook.
For the webscraped department staff information, we have already done some preliminary data manipulation inside the notebook. This is bacause some information obtained from webscraping are unnecessary or duplicated. More details can be found here Data Acquisition - Departmental Staff Data via Webscraping.ipynb.
Here, we will use the csv files stored from the two data acquisition notebooks. First of all, let's take a look at the structure of the datasets.
The staff dataset contains the information of all the staff members of the selected 12 departments, with information of their names, department they come from, labels, title (professor/Dr/or neither), and whether they are research or non-research staff. The last column of whether they are research staff is derived from the labels. We will be only using research column not the label column as there are too many unique lables. The code and details can be found in notebook Data Acquisition - Departmental Staff Data via Webscraping.ipynb .
The publications dataset contains the information of all the publications under selected 12 departments we obtained from LSE research database website. It contains the title, department, data it is published, all the author names, number of authors, and number of LSE authors for each publication.
import pandas as pd
import seaborn as sns
publications = pd.read_csv("Data/departmental_publications_data.csv")
staff = pd.read_csv("Data/departmental_staff_data.csv")
display(staff.tail())
display(publications.head())
| Name | Department | Label | Title | Category | |
|---|---|---|---|---|---|
| 1165 | Paul Willman | Management | Other academic and research staff | Professor | Research |
| 1166 | Mohamed Abouaziza | Management | Other academic and research staff | Dr | Research |
| 1167 | Anushri Gupta | Management | Other academic and research staff | Dr | Research |
| 1168 | Philipp Schoenegger | Management | Other academic and research staff | Dr | Research |
| 1169 | Oliver Seager | Management | Other academic and research staff | NaN | Research |
| Title | Department | Date | Authors | NumberOfAuthors | NumberOfStaffAuthors | |
|---|---|---|---|---|---|---|
| 0 | British incomes and property in the early nine... | Economic History | 01-12-1959 | Patrick O'Brien | 1 | 1 |
| 1 | National assistance: service or charity? | Social Policy | 01-01-1962 | Howard Glennerster | 1 | 1 |
| 2 | Twelve wasted years | Social Policy | 01-01-1963 | Howard Glennerster | 1 | 1 |
| 3 | Public schools | Social Policy | 01-01-1964 | Howard Glennerster | 1 | 1 |
| 4 | Man as tranducer for probabilities in Bayesian... | Management | 01-01-1964 | W. Edwards, Lawrence D. Phillips | 2 | 1 |
The date information we obtained from JSON file are not very useful since some of the data are not accurate. As shown in the dataframe, if the exact date is missing in JSON file, it would become the first day of that month automatically.
Instead of date, we decide to use the more accurate year information. We extract the year from the date column as following:
publications['Year'] = publications['Date'].str[-4:].astype(int)
publications.head(3)
| Title | Department | Date | Authors | NumberOfAuthors | NumberOfStaffAuthors | Year | |
|---|---|---|---|---|---|---|---|
| 0 | British incomes and property in the early nine... | Economic History | 01-12-1959 | Patrick O'Brien | 1 | 1 | 1959 |
| 1 | National assistance: service or charity? | Social Policy | 01-01-1962 | Howard Glennerster | 1 | 1 | 1962 |
| 2 | Twelve wasted years | Social Policy | 01-01-1963 | Howard Glennerster | 1 | 1 | 1963 |
To measure productivity, we want to use the total number of publications divided by the total number of staff for each department.
However, directly using the publication dataset can be problematic. The publication dataset has the oldest publication dating back to 1959. However, in 1959, not all departments have been established yet, and perhaps not all publications of that times were recorded.
Besides, the total number of staff for each department can also vary across years. Therefore ideally we should only be focusing the publications in the recent years, making the assumption that the changes in total number of staff for each department are negligible.
To determine the valid time period we will be focusing on, we use a lineplot to visualize the publications throughout decades to see from what time on the publication situation becomes stable.
Preparing dataframe for visualization
To visualize the data, we have to reorganize the original dataset to form a useful one for visualization. We use groupby function on the original dataframe to summarize the total publications for each department and each year.
all_departments = publications['Department'].unique()
all_years = publications['Year'].unique()
DF=pd.DataFrame([(department, year) for department in all_departments for year in all_years],
columns=['Department', 'Year'])
TotalPub=publications.groupby(['Department', 'Year']).size().reset_index(name='Total Publications')
DF=pd.merge(DF, TotalPub, on=['Department', 'Year'], how='left').fillna(0)
DF=DF.sort_values(by=['Year','Department']).reset_index()
DF['Total Publications']=DF['Total Publications'].astype('int')
DF
| index | Department | Year | Total Publications | |
|---|---|---|---|---|
| 0 | 384 | Anthropology | 1959 | 0 |
| 1 | 0 | Economic History | 1959 | 1 |
| 2 | 640 | Finance | 1959 | 0 |
| 3 | 512 | Geography and Environment | 1959 | 0 |
| 4 | 448 | Government | 1959 | 0 |
| ... | ... | ... | ... | ... |
| 763 | 767 | Mathematics | 2024 | 17 |
| 764 | 639 | Psychological and Behavioural Science | 2024 | 44 |
| 765 | 127 | Social Policy | 2024 | 29 |
| 766 | 319 | Sociology | 2024 | 18 |
| 767 | 383 | Statistics | 2024 | 15 |
768 rows × 4 columns
Visualization line plot
We then visualize the total publications situation through an interactive plot. We chose to use the interactive plot, as the slide bar allows us to zoom in specific years we are more interested, and to examine the changes and patterns closely.
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
fig = px.line(DF, x=DF.Year, y='Total Publications', color='Department',height=400)
fig.update_xaxes(rangeslider_visible=True)
fig.update_layout(title='Total Publications Over Time',title_x=0.2)
fig
Observations
As shown in the lineplot, as we zoom in the time period towards more recent times, we can see that there is a major drop for all departments in 2019 when the pandemic started.
After pandemic, starting from 2020, the total publications become more stable. The last drop in total publications at 2024 is because this year has just started and only the works in the first three months are recorded.
Considering the patterns, we decide to use the data of the recent 4 years, from 2020-2023(inclusive).
Using a shorter timeframe also ensures that the staff data we use is relevant as it is unlikely for each department to change their staff set up drastically enough to have a large impact on our analysis in just the last few years.
After determining the time periods we will be looking at (2020-2023), we then proceed to the investigation of departmentwise productivity differences and the factors affecting the productivity.
To investigate further, preparing the relevant calculations and reorganizing the datasets is required.
We currently have two datasets: publication and staff. We will first perform calculations on these datasets separately and merge the two after separate manipulations. For the first publication dataset, we need to obtain the Total publications and Collaboration ratio for each department between 2020-2023. As for the staff dataset, for each department we need to calculate the Total staff, Proportion of research staff, Proportion of professors, and Proportion of Dr.s.
To perform these calucalations whilst filtering the data, we believe it is way more efficient to do so through sql querying via database than pandas dataframe. Therefore, we will be conducting the calculations and filtering throught sql in the following.
Publication Dataset
To use database and sql querying, we first need to convert the csv table to a db. This is done through sqlite3.
import sqlite3
conn = sqlite3.connect('Data/pubs.db')
publications.to_sql('pubs', conn, if_exists='replace', index=False)
conn.close()
For better showcasing the results, we chose to use sql magic to perform querying. We installed the sql magic and connected to the database.
%load_ext sql
%sql sqlite:///Data/pubs.db
To make sure the table is successfully transferred to the db, we checked whether the total number of rows is correct, and whether the structure of the db is coherent with the dataframe table
len(publications)
32348
%sql SELECT COUNT(*) FROM pubs;
| COUNT(*) |
|---|
| 32348 |
%sql SELECT * FROM pubs LIMIT 3;
| Title | Department | Date | Authors | NumberOfAuthors | NumberOfStaffAuthors | Year |
|---|---|---|---|---|---|---|
| British incomes and property in the early nineteenth century | Economic History | 01-12-1959 | Patrick O'Brien | 1 | 1 | 1959 |
| National assistance: service or charity? | Social Policy | 01-01-1962 | Howard Glennerster | 1 | 1 | 1962 |
| Twelve wasted years | Social Policy | 01-01-1963 | Howard Glennerster | 1 | 1 | 1963 |
We then use sql query to calculate the total publications and the ratio of the externally collaborated publications.
WHERE clause is to filter tha data from 2020-2023;GROUP BY is used since we wanted the aggregated data for each department;COUNT(*) gives the total number of publications; AVG(CASE WHEN NumberOfAuthors>NumberOfStaffAuthors THEN 1 ELSE 0 END). Case 1 when number of authors > number of staff authors, this is the situation when there is at least one co-author from other institutions, which mean external collaboratino.%config SqlMagic.displaylimit = 15
%%sql
SELECT Department,
COUNT(*) AS TotalPublications,
ROUND(AVG(CASE WHEN NumberOfAuthors>NumberOfStaffAuthors THEN 1 ELSE 0 END),2) AS CollabRatio
FROM pubs
WHERE 2020<=Year AND Year<=2023
GROUP BY Department
ORDER BY 1,2;
| Department | TotalPublications | CollabRatio |
|---|---|---|
| Anthropology | 205 | 0.39 |
| Economic History | 173 | 0.48 |
| Finance | 130 | 0.76 |
| Geography and Environment | 667 | 0.57 |
| Government | 426 | 0.42 |
| International Relations | 378 | 0.35 |
| Management | 428 | 0.71 |
| Mathematics | 261 | 0.77 |
| Psychological and Behavioural Science | 579 | 0.55 |
| Social Policy | 519 | 0.58 |
| Sociology | 219 | 0.37 |
| Statistics | 298 | 0.76 |
After obtaining the required information, we transformed the table into pandas dataframe, for later visualization.
%%sql result <<
SELECT Department,
COUNT(*) AS TotalPublications,
ROUND(AVG(CASE WHEN NumberOfAuthors>NumberOfStaffAuthors THEN 1 ELSE 0 END),2) AS CollabRatio
FROM pubs
WHERE 2020<=Year AND Year<=2023
GROUP BY Department
ORDER BY 1,2;
merge1=result.DataFrame()
merge1.head()
| Department | TotalPublications | CollabRatio | |
|---|---|---|---|
| 0 | Anthropology | 205 | 0.39 |
| 1 | Economic History | 173 | 0.48 |
| 2 | Finance | 130 | 0.76 |
| 3 | Geography and Environment | 667 | 0.57 |
| 4 | Government | 426 | 0.42 |
%sql --close sqlite:///Data/pubs.db
Staff dataset
After manipulating the first publication dataset, we then turn to the second staff dataset. The procedure is almost the same as the first publication dataset. We first need to convert the dataframe to a db and then use sql magic for querying.
conn = sqlite3.connect('Data/staff.db')
staff.to_sql('staff', conn, if_exists='replace', index=False)
conn.close()
%sql sqlite:///Data/staff.db
We then check if the dataset is successfully and completely transferred into db, by checking the number of rows, and the structure of the table.
len(staff)
1170
%sql SELECT COUNT(*) FROM staff;
| COUNT(*) |
|---|
| 1170 |
%sql SELECT * FROM staff LIMIT 3;
| Name | Department | Label | Title | Category |
|---|---|---|---|---|
| Fabio Battaglia | Social Policy | Academic staff | Dr | Research |
| Liam Beiser-McGrath | Social Policy | Academic staff | Dr | Research |
| Thomas Biegert | Social Policy | Academic staff | Dr | Research |
After the successful transfer, we then perform the calcuations through sql querying. For this dataset, we need to calculate the total staff, proportion of research staff, professor ratio, Dr.s ratio.
GROUP BY clause is used since we wanted the aggregated data for each department;COUNT(*) gives the total number of staff; CASE WHEN is used to calculate the percentage of data with specified condition after WHEN.%%sql
SELECT Department,
COUNT(*) AS TotalStaff,
ROUND(AVG(CASE WHEN Category='Research' THEN 1 ELSE 0 END),2) AS ResearchRatio,
ROUND(AVG(CASE WHEN Title='Dr' THEN 1 ELSE 0 END),2) AS DrRatio,
ROUND(AVG(CASE WHEN Title='Professor' THEN 1 ELSE 0 END),2) AS ProfRatio
FROM staff
GROUP BY Department;
| Department | TotalStaff | ResearchRatio | DrRatio | ProfRatio |
|---|---|---|---|---|
| Anthropology | 62 | 0.9 | 0.6 | 0.27 |
| Economic History | 66 | 0.65 | 0.29 | 0.35 |
| Finance | 95 | 0.41 | 0.31 | 0.18 |
| Geography and Environment | 128 | 0.59 | 0.23 | 0.18 |
| Government | 157 | 0.68 | 0.38 | 0.27 |
| International Relations | 100 | 0.75 | 0.48 | 0.21 |
| Management | 165 | 0.64 | 0.39 | 0.22 |
| Mathematics | 83 | 0.59 | 0.27 | 0.3 |
| Psychological and Behavioural Science | 108 | 0.76 | 0.52 | 0.17 |
| Social Policy | 76 | 0.75 | 0.46 | 0.29 |
| Sociology | 70 | 0.79 | 0.61 | 0.24 |
| Statistics | 60 | 0.87 | 0.55 | 0.27 |
We then transformed the result to pandas dataframe, for later usage.
%%sql result <<
SELECT Department,
COUNT(*) AS TotalStaff,
ROUND(AVG(CASE WHEN Category='Research' THEN 1 ELSE 0 END),2) AS ResearchRatio,
ROUND(AVG(CASE WHEN Title='Dr' THEN 1 ELSE 0 END),2) AS DrRatio,
ROUND(AVG(CASE WHEN Title='Professor' THEN 1 ELSE 0 END),2) AS ProfRatio
FROM staff
GROUP BY Department;
merge2=result.DataFrame()
merge2.head()
| Department | TotalStaff | ResearchRatio | DrRatio | ProfRatio | |
|---|---|---|---|---|---|
| 0 | Anthropology | 62 | 0.90 | 0.60 | 0.27 |
| 1 | Economic History | 66 | 0.65 | 0.29 | 0.35 |
| 2 | Finance | 95 | 0.41 | 0.31 | 0.18 |
| 3 | Geography and Environment | 128 | 0.59 | 0.23 | 0.18 |
| 4 | Government | 157 | 0.68 | 0.38 | 0.27 |
%sql --close sqlite:///Data/staff.db
Merging tables
After separate calculations, we need to merge the two tables into one which would make things easier later when visualizing the data and conducting investigations.
The merged table below now has everything we need to investigate the differences in departmentwise productivity and the reasons of the productivity variations.
All the necessary manipulation and resizing have been done. We wll then move to the analysis section in Section 4.
DF2=pd.merge(merge1,merge2,how='left',on='Department')
DF2['AverageProductivity']=DF2['TotalPublications']/DF2['TotalStaff']
DF2['AverageProductivity']=DF2['AverageProductivity'].round(2)
DF2=DF2[['Department','TotalPublications','AverageProductivity','TotalStaff','ResearchRatio','ProfRatio','DrRatio','CollabRatio']]
DF2
| Department | TotalPublications | AverageProductivity | TotalStaff | ResearchRatio | ProfRatio | DrRatio | CollabRatio | |
|---|---|---|---|---|---|---|---|---|
| 0 | Anthropology | 205 | 3.31 | 62 | 0.90 | 0.27 | 0.60 | 0.39 |
| 1 | Economic History | 173 | 2.62 | 66 | 0.65 | 0.35 | 0.29 | 0.48 |
| 2 | Finance | 130 | 1.37 | 95 | 0.41 | 0.18 | 0.31 | 0.76 |
| 3 | Geography and Environment | 667 | 5.21 | 128 | 0.59 | 0.18 | 0.23 | 0.57 |
| 4 | Government | 426 | 2.71 | 157 | 0.68 | 0.27 | 0.38 | 0.42 |
| 5 | International Relations | 378 | 3.78 | 100 | 0.75 | 0.21 | 0.48 | 0.35 |
| 6 | Management | 428 | 2.59 | 165 | 0.64 | 0.22 | 0.39 | 0.71 |
| 7 | Mathematics | 261 | 3.14 | 83 | 0.59 | 0.30 | 0.27 | 0.77 |
| 8 | Psychological and Behavioural Science | 579 | 5.36 | 108 | 0.76 | 0.17 | 0.52 | 0.55 |
| 9 | Social Policy | 519 | 6.83 | 76 | 0.75 | 0.29 | 0.46 | 0.58 |
| 10 | Sociology | 219 | 3.13 | 70 | 0.79 | 0.24 | 0.61 | 0.37 |
| 11 | Statistics | 298 | 4.97 | 60 | 0.87 | 0.27 | 0.55 | 0.76 |
Based on the exploration and plots on the previous section, we found that some departments have consistently high number of publications, for instance, department of Geography and Environment. This can possibly be due to those departments are of larger sizes and have more staff.
Therefore, instead of looking at the overall publication numbers of each department which are affected by the department sizes, we now decide to focus on the productivity. We use the total number of publications from 2020 to 2023 divided by the total number of department staff as the productivity measure (assuming there are no major changes in number of staff in these years).
The tables are ordered by publications and average productvity respectively below.
display(DF2[['Department','TotalPublications']].sort_values(by='TotalPublications',ascending=False).head())
display(DF2[['Department','AverageProductivity']].sort_values(by='AverageProductivity',ascending=False).head())
| Department | TotalPublications | |
|---|---|---|
| 3 | Geography and Environment | 667 |
| 8 | Psychological and Behavioural Science | 579 |
| 9 | Social Policy | 519 |
| 6 | Management | 428 |
| 4 | Government | 426 |
| Department | AverageProductivity | |
|---|---|---|
| 9 | Social Policy | 6.83 |
| 8 | Psychological and Behavioural Science | 5.36 |
| 3 | Geography and Environment | 5.21 |
| 11 | Statistics | 4.97 |
| 5 | International Relations | 3.78 |
We can see that although Geography and Environment has the most publications, it is not the most productive one; Although Statistics is not ranked within top5 departments with most publications, in terms of average publication per staff during the 4 years, it is ranked as the top 4 among 12 departments
Although sometimes the departments with high publications also have competitive productivity. It is not a strong indication.
As shown by the relationship plot below
from scipy.stats import linregress
import matplotlib.pyplot as plt
DF2=DF2.sort_index()
fig, ax = plt.subplots(figsize=(15, 5))
ax.scatter(DF2['TotalPublications'], DF2['AverageProductivity'], color='blue', alpha=0.5)
ax.set_title('The Effect of Department Size on Departmental Average Productivity',fontsize=20,y=1.05)
ax.set_xlabel('Total Publications',fontsize=15)
ax.set_ylabel('Average Productivity',fontsize=15)
result = linregress(DF2['TotalPublications'], DF2['AverageProductivity'])
slope = result.slope
intercept = result.intercept
ax.plot(DF2['TotalPublications'], slope * DF2['TotalPublications'] + intercept, color='blue')
# Annotate each point with the department name
for j, txt in enumerate(DF2['Department']):
# Check if the department is Anthropology or Government so as to set their annotation separately to avoid overlapping
if txt in ['Anthropology', 'Government','Psychological and Behavioural Science']:
ax.annotate(txt, (DF2['TotalPublications'][j], DF2['AverageProductivity'][j]),
xytext=(0, 3), textcoords='offset points', fontsize=11)
else:
ax.annotate(txt, (DF2['TotalPublications'][j], DF2['AverageProductivity'][j]),
xytext=(5, -5), textcoords='offset points', fontsize=11)
plt.show()
When looking at the indicidual departments, we can see that although Anthropology, Sociology, and Mathematics have similar productivity, they do not share similar total publications. Mathematics nearly has 1.5 times publications as Anthropology.
Now turning to and focusing on departmentwise productivity, there is indeed quite a lot of variation. With the most productive department with nearly 7 publications per staff from 2020-2023 to the least productive department with only approximately 1 publication per person.
Productivity variations as shown in the boxplot below
plt.figure(figsize=(8, 3))
sns.boxplot(data=DF2, x='AverageProductivity', color='lightblue')
plt.title('Boxplot Distribution of Average Productivity')
max_value = DF2['AverageProductivity'].max()
min_value = DF2['AverageProductivity'].min()
plt.annotate(f'Max: {max_value}', xy=(max_value, 0), xytext=(max_value-0.7, 0.1))
plt.annotate(f'Min: {min_value}', xy=(min_value, 0), xytext=(min_value+0.05, 0.1))
plt.show()
DF2.sort_values(by='AverageProductivity',inplace=True)
palette = sns.color_palette("cividis", len(DF2))
#plt.barh(y=DF2.Department, width=DF2['Average Productivity'],alpha=0.6, color=palette);
bars = plt.barh(y=DF2.Department, width=DF2['AverageProductivity'], alpha=0.6, color=palette)
for bar in bars:
plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2, f'{bar.get_width():.2f}', ha='left', va='center')
plt.title('Productivity Differences Across LSE Departments',y=1.05);
plt.show()
Exploring the department productivity differences, we can see that Social Policy is the most productive department and the following three share similar productivity patterns around 5 per person: Psychological and Behavioural Science, Geography and Environment, Statistics.
And the rest of the departments performs silimar in terms of productivity, with the exception of Finance dropping suddenly from 3 to 1. This is probably due to the nature of the industry, as Finance is more of an applied subject and is more related to real-world applications rather than the academia.
The analysis in 4a indicates significant variations in average productivity across different departments. In this part, we aim to delve deeper into the specific factors that could account for these differences, that is, the specific factors that contribute to making an academic department more productive.
Combined with the data we have, we have identified 5 possible factors that may contribute to the departmental overall productivity:
- Department Size
- Research Staff Ratio
- Professor Ratio
- Dr Ratio
- External Collaborator Ratio
So, initially, we'll examine how these factors manifest within each department.
We will continue to utilize data from 2020 to 2023 (inclusive) for our analysis to maintain consistency.
import matplotlib.pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(1, 5, figsize=(40, 18))
# Subplot titles
titles = ['TotalStaff','ResearchRatio','ProfRatio','DrRatio','CollabRatio']
# Plot bar charts in each subplot
for i, (title, ax) in enumerate(zip(titles, axs)):
# Sort the specified column in descending order
sorted_DF2 = DF2.sort_values(by=title, ascending=True)
# Calculate the mean value
mean_value = sorted_DF2[title].mean()
# Plot bars and set colors based on whether the value is above or below the mean
for department, value in zip(sorted_DF2['Department'], sorted_DF2[title]):
if value > mean_value:
color = 'darkblue'
else:
color = 'lightblue'
bar = ax.barh(y=department, width=value, color=color, alpha=0.6)
ax.text(value, bar[0].get_y() + bar[0].get_height() / 2, f'{value:.2f}', ha='left', va='center', fontsize=12)
# Draw a dashed line for the mean value
ax.axvline(x=mean_value, color='yellow', linestyle='--', linewidth=5)
# Set x-axis limits
if i == 0:
ax.set_xlim(0, 200)
else:
ax.set_xlim(0, 1)
ax.set_title(title, fontsize=33)
ax.set_xlabel('') # Hide x-axis label
ax.tick_params(axis='both', labelsize=20)
ax.spines['top'].set_visible(False) # Hide top border
ax.spines['right'].set_visible(False) # Hide right border
ax.spines['left'].set_visible(False) # Hide left border
plt.subplots_adjust(top=0.88, bottom=0.1, left=0.1, right=0.95, wspace=0.3)
plt.suptitle("Rankings of Department under Each Factor", fontsize=40)
plt.show()
In the above plots, it's straightforward to pinpoint departments that excel compared to the majority. Additionally, we can compare the performance of different indicators within the same department. For instance, in the case of the Department of Management, despite having the highest TotalStaff count, the other four ratios either underperformed or showed only marginal improvement compared to other departments. This observation might help explain why its departmental overall productivity is relatively low.
Regarding the four ratios, we observe that the average ResearchRatio is notably higher than the ProfRatio and DrRatio. Particularly, the average ProfRatio is the lowest compared to ResearchRatio and DrRatio. This aligns with our common understanding that acquiring the title of Professor is typically more challenging.
Next, we'll proceed to estimate the effect of each factor on departmental overall productivity.
Using Pairplot and Heat Map allows us to examine the pairwise relationships and correlations, providing a rough understanding of the effect of each factor on departmental overall productivity.
Pairplot
import warnings
warnings.filterwarnings('ignore')
# Adjusting the sequense of the varaibles so as to leave our dependent variable "AverageProductivity" on the y-axis in the last row of our graph
ax=sns.pairplot(DF2[['TotalStaff','ResearchRatio','ProfRatio','DrRatio','CollabRatio','AverageProductivity']],
kind='reg', diag_kind='kde',corner=True)
ax.figure.set_size_inches(18,8)
Heat Map
import numpy as np
corrMatrix=DF2[['AverageProductivity','TotalStaff','ResearchRatio','ProfRatio','DrRatio','CollabRatio']].corr().round(2)
mask = np.triu(np.ones_like(corrMatrix, dtype=bool))[1:,:-1]
sns.heatmap(corrMatrix.iloc[1:,:-1], mask=mask, vmin=-1, vmax=1, center=0, cmap='coolwarm', linewidths=.5,
annot=True, square=True, annot_kws={"fontsize":8}, cbar_kws={"shrink":.8})
plt.xticks(rotation=30, ha='right');
We observe positive correlations between Departmental Overall Productivity and Research Staff Ratio as well as Dr Ratio, whereas the Department Size, ProfRatio, and CollabRatio exhibit negative correlations. Particularly, the correlations between Departmental Overall Productivity and ProfRatio, as well as CollabRatio, are negative but insignificant.
Specifically, we observe a strongly positive correlation between DrRatio and ResearchRatio. This correlation is likely attributable to the fact that staff with a "Dr" title constitute the majority of the Research Staff within the department.
(1) The Effect of Department Size on Productivity
We observe a slightly negative relationship between Departmental Overall Productivity and Department Size. However, we hypothesized that a larger department size implies a more extensive and comprehensive department, with higher management standards and a more diverse research field. This could potentially enhance the research productivity of staff, especially research staff, thereby increasing departmental overall productivity.
The potential reasons for the deviation of our observed results from our initial hypothesis can be attributed to two main reasons.
Reason 1: Issues alongside our methodology for calculating average productivity
Departmental size positively influences total publications, however, the productivity per capita tends to decrease as the department grows larger. Therefore, since our dependent variable is measured by TotalPublications averaged by Department Size, if the stimulating effect of department size on total publications is not counterbalanced by the increase in department size, average productivity will naturally exhibit a negative correlation with department size.
Reason 2: The existence of confounders
As department size increases, the research staff ratio tends to decrease. Because larger departments have a greater number of Non-Research Staff, leading to a decrease in the research staff ratio, consequently resulting in a decline in departmental overall productivity.
If our hypothesis holds, it suggests that the impact of research staff ratio on departmental overall productivity is more significant, while the promotional effect of departmental size on departmental overall productivity is relatively weak and partially offset.
If we want to further investigate the influence of department size on departmental overall productivity, we need to control for research staff ratio.
From the data, we can visually observe that for the Department of Geography and Environment and Mathematics, both departments have the same research staff ratio. However, the department size of the Department of Geography and Environment is significantly larger than that of Mathematics, and the average productivity of the Department of Geography and Environment is also significantly higher than that of Mathematics.
display(DF2.loc[[3, 7], ['Department', 'AverageProductivity', 'TotalStaff', 'ResearchRatio']])
| Department | AverageProductivity | TotalStaff | ResearchRatio | |
|---|---|---|---|---|
| 3 | Geography and Environment | 5.21 | 128 | 0.59 |
| 7 | Mathematics | 3.14 | 83 | 0.59 |
(2) The Effect of Departmental Research Staff Ratio on Productivity
We observe a positive and significant relationship between Departmental Overall Productivity and Research Staff Ratio, aligning with our expectations.
Common sense suggests that research productivity is primarily driven by research staff who focus on research rather than teaching or administrative tasks. Therefore, the presence of research staff significantly influences a department's overall productivity. Consequently, departments with a higher ratio of research staff are likely to exhibit greater research productivity.
In addition to the internal factors affecting departmental average productivity, we also consider external effects. A department with a higher research staff ratio likely reflects a greater emphasis on the research field within that department. This heightened focus on research could potentially inspire non-research staff to become more involved in research activities. Consequently, this increased interest in research among non-research staff could contribute to enhancing overall departmental productivity.
(3) The Effect of Departmental Professor ratio on Productivity
We observe an insignificant relationship between Departmental Overall Productivity and ProfRatio, which aligns with our initial hypothesis that the proportion of professors may not affect the departmental overall productivity significantly since professors are not the only personnel who are engaged in research activities.
(4) The Effect of Departmental Dr ratio on Productivity
We observe a positive and relatively significant relationship between Departmental Overall Productivity and DrRatio. This can be explained as there is a large proportion of Dr.s who are capable of conducting researches within each department, therefore having more significant impact on the average productivity, whereas the professors only form a small part with a minor impact.
(5) The Effect of Departmental External Collaborator Ratio on Productivity
We observe a negative but insignificant relationship between Departmental Overall Productivity and CollabRatio, which slightly contradicts our initial thought. We speculated that the presence of more external collaborators in a department would foster closer ties to academia, potentially stimulating internal productivity. However, it appears that our assumption is not supported, suggesting that external collaborations may not significantly impact overall productivity. This discrepancy could be attributed to various confounders. For instance, it can be argued that it is more difficult in publishing researches in the field of Mathematics; despite having the highest CollabRatio, department of Mathematics performs below average in terms of productivity.
display(DF2.loc[[7], ['Department', 'AverageProductivity', 'CollabRatio']])
means = DF2[['AverageProductivity', 'CollabRatio']].mean()
print(means)
| Department | AverageProductivity | CollabRatio | |
|---|---|---|---|
| 7 | Mathematics | 3.14 | 0.77 |
AverageProductivity 3.751667 CollabRatio 0.559167 dtype: float64
This project aimed to conduct an in-depth analysis of research productivity at LSE across various departments. Leveraging data primarily from 2020-2023 and utilising data acquisition, cleaning, analysis, and visualising skills, we explored potential factors contributing to departmental productivity and tried to understand the dynamics shaping these variations.
Our report discovered that the departments with the highest number of publications during the chosen time period, Geography and Environment, Psychological and Behavioural Science, and Social Policy, do not necessarily correlate to their position in average productivity, as measured by the average number of publications produced by a staff member. In average productivity, the department of Social Policy has the highest ranking with a number of 6.83 papers per staff member as compared to the lowest, the department of Finance, with an average productivity of 1.37, indicating a quite high degree of inconsistency between departments.
We can guess that this is due to the nature of the subjects themself, however further investigation into individual factors reveals that the proportion of research staff has a direct positive correlation with average productivity, with a coefficient of 0.46. Other factors seem to have negligible effects on average productivity.
Based on this, LSE should possibly allocate more funding to the department of Social Policy as it has the highest research productivity per staff and should, in the future, not allocate research fund only based on the number of staff members but also their productivity. In order to promote productivity, according to our analysis, LSE should focus more on increasing research based staff members, particularly doctors, and not just simply staff size or collaboration.
While we do come upon a conclusive answer, there are still several limitations of our analysis to consider, most stemming from the imperfect data availability and the inferences we had to make due to incomplete data. First of all, we noticed while gathering data that a lot of the professors in fact had doctorates, and so the distinction between professor and dr isn't very clear and may in fact be an arbitary preferred title. Another limitation of our analysis is the fact that we only used the departmental data for 12 departments, while LSE has 27 departments. We also assumed that there haven't been major staff changes in the past four years, which may not be the case for each department, particularly departments which recently introduced new programs. Our categorisation of each staff member to research or non-research based could have easily wrongly sorted a few of the staff members. While considering productivity, we look at the number of research published as it is easy to rank and perform calculations on. What this analysis fails to consider is that quality can be just as important a factor in productivity.
For further analysis, we would recommend first getting more accurate data as well as collecting data from more departments, ideally all departments within LSE. It might be worthwhile to also get the staff information available from previous years to make the analysis as accurate as possible. An interesting consideration could also be how LSE ranks with other universities and if the same factors affect their average productivity per department to the same extent.
As we are considering multiple factors when it comes to research productivity, multivariate regression would be a helpful tool. Potentially using the econometric model of 𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑣𝑖𝑡𝑦 = 𝛼 + 𝛽 × 𝑇𝑜𝑡𝑎𝑙𝑆𝑡𝑎𝑓𝑓 + 𝛾 × 𝑅𝑒𝑠𝑒𝑎𝑟𝑐ℎ𝑅𝑎𝑡𝑖𝑜 + 𝜀
to solve the problem stated in 4b.3(1).
LSE Staff Information per Department
Anthropology: https://www.lse.ac.uk/anthropology/people
Economic History: https://www.lse.ac.uk/Economic-History/People
Finance: https://www.lse.ac.uk/finance/people
Geography and Environment: https://www.lse.ac.uk/geography-and-environment/our-people
Government: https://www.lse.ac.uk/government/people
International Relations: https://www.lse.ac.uk/international-relations/people
Management: https://www.lse.ac.uk/management/people-home
Mathematics: https://www.lse.ac.uk/Mathematics/people
Psychological and Behavioural Science: https://www.lse.ac.uk/pbs/people
Social Policy: https://www.lse.ac.uk/social-policy/people
Sociology: https://www.lse.ac.uk/sociology/people
Statistics: https://www.lse.ac.uk/statistics/people
LSE Research Publications
https://eprints.lse.ac.uk/